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@ Voice activity detection driven noise remediator. 

@ In a method and apparatus for improving 
sound quality in a digital cellular radk) system 
receiver a voice activity detector (50) uses an 
energy estimate (from 210) to detect (in 230) the 
presence of speech in a received speech signal 
in a noise environment When no speech is 
present the system attenuates the signal (by 
240,270) and inserts low pass filtered white 
noise (by 270). In addition, a set of high pass 
fDters (in 260) are used to filter the signal based 
upon the background noise level (from 220). 
This high pass tittering is applied to the signal 
regardless of whetiier speech is present Thus, 
a combination of signal attenuation (in 270) 
with insertton of low pass filtered white noise 
(from 250) during periods of non-speech, along 
with high pass filtering (in 260) of the signal, 
improves sound quality when decoding speech 
whk:h has been encoded in a noisy environ- 
ment 
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Field of the Invention 

The present invention relates generally to digital nfK)bile radio systenis. In particular, this Invention relates 
to improving the voice quality in a digital mobile radio receiver in the presence of audio background noise. 

5 

Background of the invention 

A cellular telephone system comprises three essential elements: a cellular switching system that serves 
as the gateway to the landline (wired) telephone network, a number of base stations under the switching sys- 

10 tem's control that contain equipment that translates between the signals used in the wired telephone network 
and the radio signals used for wireless communications, and a number of mobile telephone units that translate 
between the radio signals used to communicate with the base stations and the audible acoustic signals used 
to conrmiunicate with human users (e.g. speech, music, etc.). 

Conrvnunication between a base statnn and a mobile telephone is possible only if both the base station 

15 and the nrK>bile telephone use identical radio nvodulation schemes, data-encoding conventions, and control 
strategies, i.e. both units must conform to an air-interface specif icatbn. A number of standards have been es- 
tablished for air-interfaces in the United States. Until recently, all cellular telephony in the United States has 
operated according to the Advanced Mobile Phone Service (AMPS) standard. This standard specifies analog 
signal encoding using frequency nrKxJulation in the 800 MHz region of the radio spectrum. Under this scheme, 

20 each cellular telephone conversation is assigned a communications channel consisting of two 30 KHz seg- 
ments of this region for the duration of the call. In order to avoid interference between conversations, no two 
conversatnns may occupy the same channel simultaneously within the same geographic area. Since the entire 
portion of the radio spectrum allocated to cellular telephony is finite, this restriction places a limit on the number 
of sinrvultaneous users of a cellular telephone system. 

25 In order to increase the capacity of the system, a number of alternatives to the AMPS standard have been 

introduced. One of these is the Interim Standard-54 (IS-54), issued by the Electronic Industries Association 
and the Teleconvnunications Industry Association. This standard makes use of digital signal encoding and 
modulatbn using a time division multiple access (TDMA) scheme. Under the TDMA scheme, each 30 KHz seg- 
ment is shared by three simultaneous conversations, and each conversatk>n is permitted to use the channel 

30 one-third of the time. Time is divided into 20ms frames, and each frame is further sub-divided into three time 
slots. Each conversation is allotted one time slot per frame. 

To permit all of the information describing 20ms of conver5atk>n to be conveyed in a single tinr>e slot, speech 
and other audio signals are processed using a digital speech compressk)n method known as Vector Sum Ex- 
cited Linear Predictk>n (VSELP). Each IS-54 compliant base statk>n and nfK>bile telephone unit contains a 

35 VSELP encoder and decoder. Instead of transmitting a digital representation of the audio waveform over the 
channel, the VSELP encoder makes use of a model of human speech production to reduce the digitized audk> 
signal to a set of parameters that represent the state of the speech production mechanism during the frame 
(e.g. the pitch, the vocal tract configuration, etc). These parameters are encoded as a digital bit-stream, and 
are then transmitted over the channel to the receiver at 8 kilobits per second (kbs). This is a much lower bit 

40 rate than would be required to encode the actual audio waveform. The VSELP decoder at the receiver then 
uses these parameters to recreate an estinmte of the digitized audio waveform. The transmitted digital speech 
date is organized into digital information frames of 20ms, each containing 160 samples. There are 159 bits 
per speech frame. The VSELP method is described in detail in the document, TR45 Full-Rate SPeech Codec 
Compatibility Standard PN-2972, 1990. published by the Electronics Industries Association, which is fully In- 

45 corporated herein by reference (hereinafter referred to as "VSELP Standard"). 

VSELP signif rcantty reduces the number of bits required to transmit audio information over the commu- 
nications channel. However, it achieves this reduction by relying heavily on a model of speech production. Con- 
sequently, it renders non-speech sounds poorly. For example, the interior of a moving autonrK)bile is an inher- 
ently noisy environment The autonrK)bile's own sounds combine with external noises to create an acoustic 

50 background noise level much higher than is typically encountered in non-nrK)bile environments. This situation 
forces VSELP to attempt to encode non-speech information much of the time, as wall as combinatons of 
speech and background noise. 

Two problems arise when VSELP is used to encode speech in the presence of background noise. First, 
the background noise sounds unnatural whether or not there is speech present, and second, the speech is 

55 distorted in a characteristic way. Individually and collectively these problems are commonly referred to as 
"swirl". 

While it would be possible to eliminate these artifacts introduced by the encoding/decoding process by 
replacing the VSELP algorithm with another speech compressbn algorithm that does not suffer from the same 



2 



EP0 665 530 A1 



deficiencies, this strategy would require changing the IS-54 Air Interface Specification. Such a change is un- 
desirable because of the considerable investment in existing equipment on the part of cellular telephone ser- 
vice providers, nr>anufacturers and subscribers. For example, in one prior art technique, the speech encoder 
detects when no speech is present and encodes a special frame to be transmitted to the receiver. This special 

5 frame contains comfort noise parameters which indicate that the speech decoder is to generate comfort noise 
which is similar to the background noise on the transmit side. These special frames are transmitted periodically 
by the transmitter during periods of non-speech. This proposed solution to the swirl problem requires a change 
to the current VSELP speech algorithm t>ecause it introduces special encoded frames to indicate when comfort 
noise is to be generated. It is implemented at both the transmit and receive sides of the communication channel, 

10 and requires a change in the current air interface specification standard. It is therefore an undesirable solution. 

Summary of the Invention 

One object of the present invention is to reduce the severity of the artifacts introduced by VSELP (or any 

IS other speech coding/decoding algorithm) when used in the presence of acoustic background noise, without 
requiring any changes to the air interface specif icatbn. 

It has been determined that a combination of signal attenuation with comfort noise insertion during periods 
of non-speech, and selective high pass filtering t>ased on an estimate of the background noise energy is an 
effective solutton to the swirl problem discussed above. 

20 In accordance with the present invention, a voice activity detector uses an energy estimate to detect the 

presence of speech in the received speech signal in a noise environment When no speech is present, the sys- 
tem attenuates the signal and inserts low-pass filtered white noise (i.e. comfort noise) at an appropriate level. 
This comfort noise mimics the typical spectral characteristics of autonnobile or other background noise. This 
snrKX>thes out the swirl nnaking it sound natural. When speech is determined to be present in the signal by the 

25 voice activity detector, the synthesized s|>eech signal is processed with no attenuatk>n. 

It has been determined that the perceptually annoying artifacts that the speech encoder introduces when 
trying to encode both speech and noise occur nfK>stly in the lower frequency range. Therefore, in addition to 
the voice activity driven attenuation and comfort noise insertion, a set of high pass filters are used depending 
on the background noise level. This filtering is applied to the speech signal regardless of whether speech is 

30 present or not If the noise level is found to be less than -52db, no high pass filtering is used. If the noise level 
is between -40db and -52db, a high pass filter with a cutoff frequency of 200 Hz is applied to the synthesized 
speech signal. If the noise level is greater than -40db, a high pass fitter with a cutoff frequency of 350 Hz is 
applied. The result of these high pass filters is reduced background noise with little affect on the speech quality. 
The invention described herein is employed at the receiver (either at the base station, the mobile unit, or 

35 both) and thus it may t>e implemented without the necessity of a change to the current standard speech en- 
coding/decoding protocol. 

Brief Description of the Drawings 

40 Fig. 1 is a block diagram of a digital radio receiving system incorporating the present inventbn. 

Fig. 2 is a block diagram of the voice activity detection driven noise remediator in accordance with the pres- 
ent invention. 

Fig. 3 is a waveform depcting the total acoustic energy of a received signal. 
Fig. 4 Is a block diagram of a high pass filter driver. 
45 Fig. 5 is a flow diagram of the f unctbning of the voice activity detector. 

Fig. 6 shows a block diagram of a microprocessor embodiment of the present invention. 

Detailed Description 

so A digital radio receiving system 10 incorporating the present invention is shown in Fig. 1. A demodulator 
20 receives transmitted waveforms corresponding to encoded speech signals and processes the received wa- 
veforms to produce a digital signal d. This digital sfgnal d is provided to a channel decoder 30 which processes 
the signal d to mitigate channel errors. The resulting signal generated by the channel decoder 30 is an encoded 
speech bit stream b organized into digital infonmation frames in accordance with the VSELP standard dis- 
ss cussed above in the background of the inventton. This encoded speech bit stream b is provided to a speech 
decoder 40 which processes the encoded speech bit stream b to produce a decoded speech bit stream s. This 
speech decoder 40 is configured to decode speech which has been encoded in accordance with the VSELP 
technique. This decoded speech bit stream s is provided to a voice activity detection driven noise remediator 
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(VADDNR) 50 to remove any background "swirl" present in the signal during periods of non-speech. In one 
ennbodiment. the VADDNR 50 also receives a portion of the encoded speech bit stream b from the channel 
decoder 30 over signal line 35. The VADDNR 50 uses the VSELP coded frame energy value rO which is part 
of the encoded bit stream b. as discussed in more detail below. The VADDNR 50 generates a processed de- 
5 coded speech bit stream output s". The output from the VADDNR 50 may then be provided to a digital to analog 
converter 60 which converts the digital signal s" to an analog waveform. This analog waveform may then be 
sent to a destination system, such as a telephone network. Alternatively, the output from the VADDNR 50 may 
be provided to another device that converts the VADDNR output to some other digital data format used by a 
destination system. 

10 The VADDNR 50 is shown in greater detail in Fig. 2. The VADDNR receives the VSELP coded frame energy 
value rO from the encoded speech bit stream b over signal line 35 as shown in Fig. 1. This energy value rO 
represents the average signal power in the input speech over the 20ms frame interval. TTiere are 32 possible 
values for rO. 0 through 31. r0=0 represents a frame energy of 0. The remaining values for rO range from a 
minimum of -64db. corresponding to rO=1 , to a maximum of -4db. corresponding to rO=31. The step size be- 
ts tween rO values is 2db. The frame energy value rO is described in more detail in VSELP Standard , p. 16. The 
coded frame energy value rO is provided to an energy estimator 210 which determines the average frame en- 
ergy. 

The energy estinnator 210 generates an average frame energy signal e[m] which represents the average 
frame energy computed during a frsme m, where m is a frame index which represents the current digital in- 
20 formation f ranie. e[m] is defined as: 

, , Einit f or m = 0 

e[in] = 

25 a * rO[m] + (l-a) * e[m-ll for m > 0 



The average frame energy is initially set to an initial energy estimate Einit Einit is set to a value greater than 
31 , which is the largest possible value for rO. For example, Einit could be set to a value of 32. After initialization, 
30 the average frame energy e[m] will be calculated by the equation e[ml = a * rO[m] + (1-a) « e[rrv1], where a 
is a snfX>othing constant with 0 ^ a ^ 1 . a should be chosen to provide acceptable frame averaging. We have 
found that a value of a = 0^5 to be optimal, giving effective frame averaging over seven frames of digital in- 
formation (140 ms). Different values of a could be chosen, with the value preferably being in the range of 0.25 
±0.2. 

35 As discussed above, and as shown in Fig. 1 , the VADDNR 50 receives the VSELP coded f ran>e energy 
value rO f rom the encoded speech bit stream signal b prior to the signal b being decoded by the speech decoder 
40. Alternatively, this franm energy value rO could be calculated by the VADDNR 50 itself from the decoded 
speech bit stream signal s received from the speech decoder 40. In an embodiment where the frame energy 
value rO is calculated by the VADDNR 50, there is no need to provide any part of the encoded speech bit stream 

40 b to the VADDNR 50, and signal line 35 shown in Fig. 1 would not be present Instead, the VADDNR 50 would 
process only the decoded speech bit stream s, and the frame energy value rO would be calculated as described 
*n VSELP Standard, pp. 16 - 17. However, by providing rO to the VADDNR 50 from the encoded bit stream b 
over signal line 35, the VADDNR can process the decoded speech bit stream s more quickly because it does 
not have to calculate rO. 

45 The average frame energy signal e[m] produced by the energy estimator 21 0 represents the average total 
acoustic energy present in the received speech signal. This total acoustic energy may be comprised of both 
speech and noise. As an example, Fig. 3 shows a wavefbnm depicting the total acoustic energy of a typical 
received signal 31 0 over time T. In a nfK>bile environment, there wilt typically be a certain level of ambient back- 
ground noise. The energy level of this noise is shown in Fig. 3 as Oi. When speech is present in the signal 310, 

50 the acoustic energy level will represent both speech and noise. Hiis is shown in Fig. 3 in the range where energy 
> 02. During time interval ti speech is not present in the signal 310 and the acoustic energy during this time 
interval t^ represents ambient background noise only. During time interval t2, speech is present in the signal 
310 and the acoustic energy during this tinrte interval t2 represents ambient background noise plus speech. 
Referring to Fig. 2, the output signal e[m] produced by the energy estimator 210 is provided to a noise es- 

55 timator 220 which determines the average background noise level in the decoded speech bit stream s. The 
noise estimator 220 generates a signal N[m] which represents a noise estimate value, where: 
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Ninit for m = 0 

N[niJ = N[m-1] for e[m] > N[m-1] + Nthresh 

5 

P * e[m] + (1-p) * N[m-1] otherwise 

Initially, N[m] is set to the initial value Ninit, which Is an initial noise estimate. During further processing, the 
10 value N[m] will increase or decrease based upon the actual background noise present in the decoded speech 
bit stream s. Ninit Is set to a level which is on the boundary between nrKxlerate and severe background noise. 
Initializing N[m] to this level permits N[m] to adapt quickly in either direction as determined by the actual back- 
ground noise. We have found that in a mobile environment it is preferable to set Ninit to an rO value of 13. 

The speech component of signal energy should not be included in calculating the average background ^ 
15 noise level. For example, referring to Fig. 3, the energy level present in the signal 310 during time interval ti 
should be included in calculating the noise estimate N[m], but the energy level present in the signal 310 during 
time interval t2 should not be included because the energy during time interval t2 represents both background 
noise and speech. 

Thus, any average frame energy e[m], received from the energy estimator 210 which represents both 

20 speech and noise should be excluded from the calculation of the noise estimate N[m] in order to prevent the 
noise estimate N[m] from t>ecoming biased. In order to exclude average frame energy e[m] values which rep- 
resent both speech and noise, an upper noise dipping threshold, Nthresh, is used. Thus, as stated above, If 
e[m] > N[nv1] + Nthresh then N[m] = N[nv1]. In other words, If the current frame's averageframe energy, e[m], 
is greater than the prior frame's noise estimate, N[M-1], by an amount equal to or greater than Nthresh, I.e. 

25 speech is present, then N[m] is not changed from the previous frame's calculatk)n. Thus, If there is a large 
increase of frame energy over a short time period, then it is assumed that this increase is due to the presence 
of speech and the energy is not included in the noise estimate. We have found it optimal to set Nthresh to the 
equivalent of a frame energy rO value of 2.5. This limits the operatbnal range of the noise estimate algorithm 
to conditions with better than 5db audio signal to noise ratk). since rO is scaled in units of 2db. Nthresh could 

30 be set anywhere in the range of 2 to 4 for acceptable performance of the noise estimator 220. 

If there is not a large increase of frame energy over a short time period, then the noise estimate is deter- 
mined by the equation N[ml = p « e[m] + (1-p) * N[nrv1], where p Is a smoothing constant which should be set 
to provide acceptable irawe averaging. A value of 0.05 for p, which gives f ran>e averaging over 25 frames 
(500ms) has been found preferable. The value of p should generally be set in the range of 0.025 ^ p ^ 0.1. 

35 The noise estimate value N[m] calculated by the noise estimator 220 is provkled to a high pass filter driver 
260 which operates on the decoded bit stream signal s provided from the speech decoder 40. As discussed 
above, each digital information frame contains 160 samples of speech data. TTie high pass filter driver 260 
operates on each of these samples s[i], where i is a sampling index. The high pass filter driver 260 is shown 
in further detail in Fig. 4. The noise estinr^ts value N[m] generated by the noise estimator 220 is provided to 

40 logic block 410 which contains logic circuitry to determine which of a set of high pass filters will be used to 
filter each sample s[i] of the decoded speech bit stream s. There are two high pass filters 430 and 440. Filter 
430 has a cutoff frequency at 200 Hz and filter 440 has a cutoff frequency at 350 Hz. These cutoff frequencies 
have been determined to provide optimal results, however other values may be used in accordance with the 
present invention. The difference in cutoff frequencies between the filters should preferably t>e at least 100 

45 Hz. In order to determine which filter should be used, the logic block 41 0 of the high pass filter driver 260 conv 
pares the noise estimate value N[m] with two thresholds. The first threshold is set to a value corresponding 
to a frame energy value r0=7 (corresponding to -52db), and the second threshold is set to a value corresponding 
to a frame energy value rO=1 3 (corresponding to -40db). If the noise estimate N[m] is less than rO=7, then there 
is no high pass filtering applied. If the noise estimate value N[m] is greater than or equal to r0=7 and less than 

50 rO=1 3, then the 200 Hz high pass filter 430 is applied. If the noise estimate value N[m] is greater than or equal 
to r0=1 3, then the 350 Hz high pass filter 440 is applied. The logic for detenmining the high pass filtering to be 
applied can be summarized as: 
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all pass for N[ni] < 7 

filter = high pass at 200 Hz for 7 < Nfml 

^ < 13 

high pass at 350 Kz for N[in] > 13 

fO 

With reference to Fig. 4, this logic is carried out by logic block 410. Logic block 410 will determine which 
filter Is to be applied based upon the above rules and will provide a control signal c[m] to two cross bar switches 
420,450. A control signal corresponding to a value of 0 indicates that no high pass filtering should be applied. 
A control signal corresponding to a value of 1 indicates that the 200 Hz high pass filter should t>e applied. A 

15 control signal corresponding to a value of 2 indicates that the 350 Hz high pass filter should be applied. 

The signal sp] Is provided to the cross bar switch 420 from the speech decoder 40. The cross bar switch 
420 directs the signal s[i] to the appropriate signal line 421 , 422, 423 to select the appropriate filtering. Acontrol 
signal of 0 wilt direct signal s[i] to signal line 421. Signal line 421 will provide the signal s[i] to cross bar switch 
450 with no filtering being applied. Acontrol signal of 1 will direct signal s[i] to signal line 422, which is connected 

20 to high pass filter 430. After the signal sp] is filtered by high pass filter 430, it is provided to cross bar switch 
450 over signal line 424. Acontrol signal of 2 will direct signal s[i] to signal line 423, which Is connected to high 
pass filter 440. After the signal s[i] is filtered by high pass filter 440, it is provkJed to cross bar switch 450 over 
signal line 425. The control signal c[m] Is also provided to the cross bar switch 450. Based upon the control 
signal c[m], cross bar switch 450 will provkJe one of the signals from signal line 421, 424, 425 to the speech 

25 attenuator 270. This signal produced by the high pass filter driver 260 Is identified as s'p]. Those skilled in the 
art will recognize that any number of high pass filters or a single high pass filter with a continuously adjustable 
cutoff frequency could be used in the high pass filter driver 260 to filter the decoded bit stream s. Use of a 
larger numt>er of high pass filters or a single high pass filter with a continuously adjustable cutoff frequency 
would make the transitions between filter selectk)ns less noticeable. 

30 Referring to Fig. 2, the signal s'[i] produced by the high pass filter driver 260 is provided to a speech at- 

tenuator/comfort noise inserter 270. The speech attenuator^comfort noise inserter 270 will process the signal 
s'p] to prcKluce the prcK^essed decoded speech bit stream output signal s^p]- "Hie spe^h attenuator/connfort 
noise inserter 270 also receives input signal np] from a shaped noise generator 250 and input signal atten[m] 
from an attenuator calculator 240. The functioning of the speech attenuator/comfort noise inserter 270 will be 

35 discussed in detail below, following a discussion of how its inputs np] and atten[m] are calculated. 

The noise estimate N[m] produced by the noise estimator 220, and the average frame energy e[m] pro- 
duced by the energy estimator 210, are provkied to the voice activity detector 230. The voice activity detector 
230 determines whether or not speech is present in the current frame of the speech signal and produces a 
voice detection signal v[m] which indicates whether or not speech is present A value of 0 for v[m] indicates 

40 that there is no voice activity detected in the current frame of the speech signal. A value of 1 for v[m] indicates 
that voice activity is detected in the current frame of the speech signal. The f unctbning of the vorce activity 
detector 230 is described in conjunctk)n with the flow diagram of Fig. 5. In step 505, the voice activity detector 
230 will determine whether e[m] < N[m] Tdetect, where Tdetect is a lower noise detectton threshold, and is 
similar In function to the Nthresh value discussed above in conjunction with Fig. 3. The assumption is made 

45 that speech may only be present when the average frame energy e[m] is greater than the noise estimate value 
N[m] by some value, Tdetect Tdetect is preferably set to an rO value of 2.5 which means that speech nr^y only 
be present if the average frame energy e[m] is greater than the noise estimate value N[m] by 5db. Other values 
may also be used. The value of Tdetect should generally be within the range 2.5 0.5. 

In order to prevent the votoe activity detector 230 from declaring no voice activity within words, an unde- 

50 tected frame counter Ncnt is used. Ncnt Is initialized to zero and is set to count up to a threshold, Ncntthresh. 
which represents the number of frannes containing no voice activity which must be present before the voice 
activity detector 230 declares that no voice activity is present Ncntthresh may be set to a value of sbc. Thus, 
only if no speech is detected for six frames (120nr^) will the vorce activity detector 230 declare no voice. Re- 
turning now to Fig. 5, if step 505 determines that e[m] < N[m] + Tdetect, i.e. the average energy e[m] is less 

55 than that for which it has been determined that speech may be present, then Ncnt is incremented by one in 
step 510. If step 515 determines that Ncnt ^ Ncntthresh, i.e., that there have been 6 frannes in whk:h no speech 
has been detected, then v[m] is set to 0 in step 530 to indbate no speech for the current frame. If step 515 
determines that Ncnt < Ncntthresh, i.e. that there have not yet been 6 frames in which no speech has been 
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detected, then v[m] is set to 1 in step 520 to indicate there is speech present in the current frame. If step 505 
determines that e[m] ^ N[m] + Tdetect, i.e. the average energy e[m] is greater than or equal to that for which 
it has been determined that speech may be present, then Ncnt is set to zero in step 525 and v[m] is set to one 
in step 520 to indicate that there speech present in the current frame. 

5 The voice detection signal v[m] produced by the voice activity detector 230 is provided to the attenuator 

calculator 240, which produces an attenuation signal, atten[m], which represents the amount of attenuation 
of the current frame. The attenuation signal atten[m] is updated every frame, and its value depends in part 
upon whether or not voice activity was detected by the voice activity detector 230. The signal atten[m] will rep- 
resent some value between 0 and 1. The closer to 1, the less the attenuation of the signal, and the closer to 

10 0. the more the attenuation of the signal. The maximum attenuation to be applied is defined as maxatten, and 
it has been determined that the optimal value for maxatten is .65 (i.e., -3.7db). Other values for maxatten may 
be used however, with the value generally being in the range 0.3 to 0.8. The factor by which the attenuation 
of the speech signal is increased is defined as attenrate, and the preferred value for attenrate has been found 
to be .98. Other values may be used for attenrate however, with the value generally in the range of 0.95 +/- .04. 

IS In this section, we describe the calculation of the attenuatbn signal atten[m]. The use of atten[m] in at- 
tenuating the signal s'[i] will become dear during the discussion below in conjunction with the speech attenu- 
ator/comfort noise inserter 270. The attenuation signal atten[m] is calculated as follows. Initially, the attenua- 
tion signal atten[m] is set to 1. Following this initialization, attenfm] will be calculated based upon whether 
speech is present, as detenmined by the voice activity detector 230, and whether the attenuation has reached 

20 the maximum attenuation as defined by maxatten. If v[m] = 1, i.e. speech is detected, then atten[m] is set to 
1 . If v[m] =0, i.e. no speech is detected, and if the attenuation factor applied to the previous frame's attenuation 
(attenrate • atten[nn-1]) is greater than the maximum attenuation, then the current frame attenuation is calcu- 
lated by applying the attenuation factor to the previous f range's attenuation. If v[m] =0, i.e. no speech is de- 
tected, and if the attenuation factor applied to the previous frame's attenuation is less than or equal to the 

25 maximum attenuation, then the current frame attenuation is set to the maximum attenuation. This calculation 
of the current frame attenuation is summarized as: 
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1.0 for m = 0 or v[m] = 1 

att:en[in] = attenrate * atten [m-l] for attenrate * 

atten[m-l] > maxatten 
and v[m] = 0 

maxatten for attenrate * 
atten[m-l] ^ 

maxattn and v[in] = 0 



40 Thus, when no speech is detected by the voice activity detector 230, the attenuation signal atten[m] is reduced 
from 1 to .65(maxatten) by a constant factor .98. The current frame attenuation signal, atten[m], generated by 
the attenuation calculator 240 is provided to the speech attenuator/comfort noise inserter 270. 

The speech attenuator/comfort noise inserter 270 also receives the signal np]. which represents low-pass 
filtered white noise, from the shaped noise generator 250. This low pass filtered white noise is also referred 
45 to as comfort noise. The shaped noise generator 250 receives the noise estimate N[m] from the noise estin^tor 
220 and generates the signal n[Q which represents the shaped noise as follows: 

n[i] = 8 • wn[i] + (1 - e) ♦ n[i - 1] where, 
wn[i] = $ « dB21in (N[m]) • ran[i] 
where 1 is the sampling index as discussed above. Thus, np] Is generated for each sample in the current frame. 
50 The function dB21in maps the noise estimate N[m] from a dB to a linear value. The scale foctor d is set to a 
value of 1.7 and the filter coefficient e is set to a value of 0.1. The function ranp] generates a random number 
between -1.0 and 1.0. Thus, the noise is scaled using the noise estimate N[m] and then filtered by a low pass 
filter. The above stated values for the scale factor 6 and the filter coefficient a have been found to be optimal. 
Other values may be used however, with the value of 5 generally in the range 1.5 to 2.0, and the value a gen- 
55 erally in the range 0.05 to 0.1 5. 

The low-pass filtered white noise n[i] generated by the shaped noise generator 220 and the currentframe's 
attenuation atten[m] generated by the attenuator calculator 240 are provided to the speech attenuator/comfort 
noise inserter 270. The speech attenuator receives the high pass filtered signal s'[i] from the high pass filter 
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driver 260 and generates the processed decoded speech bit stream s" according to the following equation: 

s"[il = atten[m] • sTi] + (1 - atten[m]) * n[i]. 

fori = 0.1,....159 

Thus, for each sample s'p] in the high pass filtered speech signal s\ the speech attenuator/comfort noise in- 
5 serter 270 will attenuate the sample s'p] by the current frame's attenuation atten[m]. At the same time, the 
speech attenuator/comfort noise Inserter 270 will also insert the low pass filtered white noise n[i] based on the 
value of atten[m]. As can be seen from the above equation. If atten[m] = 1. then there will be no attenuation 
and s''[i] = s'[i]. If attenim] = maxatten (.65) then s"[il = (.65 • high pass filtered speech signal) + (.35 ♦ low 
pass filtered white noise). The effect of the attenuation of the signal s'p] plus the insertion of low pass filtered 
10 white noise (comfort noise) Is to provide a smoother background noise with less perceived swirl. The signal 
s"[i] generated by the speech attenuator/comfort noise inserter 270 may be provided to the digital to analog 
converter 60, or to another device that converts the signal to some other digital data format, as discussed 
above. 

As discussed above, the attenuator calculator 240, the shaped noise generator 250, and the speech at- 

15 tenuator/comfbrt noise inserter 270 operate in conjunction to reduce the background swirl when no speech is 
present In the received signal. These elements could be considered as a single noise renradtator, which is 
shown In Fig. 2 within the dotted lines as 280. This noise remediator 280 receives the voice detection signal 
v[m] from the voice activity detector 230. the noise estimate N[m] from the noise estimator 220, and the high 
pass filtered signal s'p] from the high pass filter driver 260, and generates the processed decoded speech bit 

20 stream s^p] as discussed above. 

A suitable VADDNR 50 as described above could be Implemented In a microprocessor as shown in Fig. 
6. The microprocessor (^) 610 is connected to a non-volatile memory 620, such as a ROM, by a data line 621 
and an address line 622. The non-volatile nnemory 620 contains program code to implement the functions of 
the VADDNR 50 as discussed above. The microprocessor 610 Is also connected to a volatile memory 630. 

25 such as a RAM, by data line 631 and address line 632. The microprocessor 610 receives the decoded speech 
bit stream s from the speech decoder 40 on signal line 612, and generates a processed decoded speech bit 
stream s". As discussed above, in one embodiment of the present Invention, the VSELP coded frame energy 
value rO is provided to the VADDNR 50 from the encoded speech bit stream b. This is shown in Fig. 6 by the 
signal line 611. In an alternate embodiment, the VADDNR calculates the frame energy value rO from the de- 

30 coded speech bit stream s, and signal line 611 would not be present 

It Is to be underetood that the embodiments and variations shown and described herein are illustrative of 
the principles of the invention only and that various modif bations may be implemented by those skilled In the 
art without departing from the scope and spirit of the Invention. Throughout this description, various preferred 
values, and ranges of values, have been disclosed. However, it is to t>e understood that these values are related 

35 to the use of the present invention In a nrK)bile environment Those skilled in the art will recognize that the in- 
vention disclosed herein may be utilized in various environments, in which case values, and ranges of values, 
may vary from those discussed herein. Such use of the present invention In various envronnnents along with 
the variattons of values are within the contemplated scope of the present Invention. 

40 

Claims 

1. An apparatus for processing a received signal, said signal comprising a speech component and a noise 
component, said apparatus comprising: 

45 an energy estimator for generating an energy signal representing the acoustic energy of said re- 

ceived signal; 

a noise estimator for receiving said energy signal and for generating a noise estimate signal rep- 
resenting the average background noise in said received signal; 

a voice activity detector for receiving saki noise estimate signal and said energy signal and for gen- 
50 erating a voice detection signal representing whether speech is present In said received signal; and 

a noise remediator responsive to said noise estimate signal and saki voice detection signal for proc- 
essing saki received signal when said voice detection signal indicates that speech Is not present in said 
received signal and for generating a processed signal, 
wherein said processed signal comprises: 
55 a first component comprising an attenuated received signal; and 

a second component comprising a comfort noise signal. 

2. The apparatus of daim 1 wherein said voice detector generates a voice detection signal indicating that 
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speech is not present only wtien no speech is detected in said received signal for a predetermined period 
of time. 

3. The apparatus of daim 1 wherein said comfort noise comprises low pass filtered white noise. 

4. The apparatus of daim 1 wherein said noise remediator further comprises: 
an attenuator calculator for receiving said voice detection signal and for generating an attentuation 

signal representing the attenuation to be applied to said received signal; 

a shaped noise generator for receiving said noise estimate signal and for generating said comfort 
noise signal; and 

a speech attenuator/comfort noise inserter responsive to said comfort noise signal and said attenu- 
ation signal for receiving said received signal and for attenuating said received signal and inserting said 
comfort noise signal into said received signal. 

5. The apparatus of daim 4 wherein said comfort noise signal represents low pass filtered white noise scaled 
based upon said noise estimate signal. 

6. A method for processing a received signal representing speech and noise, sakj method comprising the 
steps of: 

generating an energy signal representing the acoustic energy of said received signal; 
generating a noise estimate signal representing the average t)ackground noise in said received sig- 
nal; and 

generating a high pass filtered signal by applying said received signal to one of a plurality of high 
pass filters based upon said noise estinrtate signal. 

^ 7. The method of daim 6 wherein the difference in the cutoff frequencies of each of said plurality of high 
pass filters is at least 100Hz. 

8. The method of daim 6 further comprising the steps of: 

generating a voice detection signal based upon said energy signal and said noise estimate signal, 
30 said voice detection signal indicating whether said received signal contains a speech component; and 

generating a processed high pass filtered signal if said voice detection signal indicates that said 
received signal does not contain a speech component 

9. The method of daim 8 wherein said step of generating a processed high pass filtered signal further conv 
35 prises the steps of: 

generating a comfort noise signal based upon said noise estimate signal; 

attenuating said high pass filtered signal; and 

inserting said comfort noise signal into said high pass filtered signal. 
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10. The method of daim 9 wherein said comfort noise signal comprises low pass filtered white noise scaled 
based upon said noise estimate signal. 

11. A method for processing a received signal representing speech and noise, said method comprising the 
steps of: 

generating an energy value representing the acoustic energy of said receh^ed signal; 
generating a noise estinuite value representing the average t>ackground noise in said received sig- 
nal; 

generating a high pass filtered signal by applying said received signal to one of a plurality of high 
pass filters based upon said noise estimate value; 

generating comfort noise based on said noise estimate value; 

determining whether said received signal contains a speech component based upon said energy 
value and said noise estimate value; and 

generating a processed high pass filtered signal if said received signal does not contain a speech 
component 

12. The method of daim 11 wherein the difference in the cutoff frequencies of each of said plurality of high 
pass filters is at least 100Hz. 
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13. The method of daim 11 wherein said step of generating a processed high pass filtered signal further conv 
prises the steps of: 

attenuating said high pass filtered signal; and inserting said comfort noise into said high pass fil- 
tered signal. 

14w An apparatus for processing a received encoded signal representing speech and noise, said apparatus 
comprising: 

means for receiving said encoded signal; 

means for decoding said encoded signal Into a decoded signal; 

means for generating an energy value representing the acoustic energy of said decoded signal; 
means for generating a noise estimate value representing the average background noise level in 
said decoded signal; 

means for determining whether said decoded signal contains a speech component based upon said 
energy value and said noise estimate value; and 

means for generating a processed decoded signal if the decoded signal does not contain a speech 
component for a predetermined period of time, said processed decoded signal comprising an attenuated 
decoded signal component and a comfort noise component 

15. An apparatus for processing a received signal, said received signal comprising a speech component and 
a noise component, said apparatus comprising: 

means for generating an energy value representing the acoustic energy of said received signal; 
means for generating a noise estimate value representing the average background noise in sak) 
received signal; and 

means for generating a high pass filtered signal by applying sakJ received signal to one of a plurality 
of high pass filters based upon saki noise estimate value. 

16. The apparatus of claim 15 wherein the difference In the cutoff frequencies of each of said plurality of high 
pass is at least 100Hz. 

17. The apparatus of daim 15 further comprising: 

means for determining whether said received signal contains a speech component and 
means for generating a processed high pass filtered signal if sakl received signal does not contain 
a speech component 

1 8. The apparatus of daim 1 7 wherein said means for generating a processed high pass filtered signal further 
comprises: 

means for generating comfort noise based on said noise estimate value; 
means for attenuating sakJ high pass filtered signal; and means for inserting said comfort noise 
into said high pass filtered signal. 
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